SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images

Kaiyu Li1 , Ruixun Liu1 , Xiangyong Cao1✉ , Xueru Bai2 , Feng Zhou2 , Deyu Meng1 , Zhi Wang1

Xi’an Jiaotong University1 Xidian University2

Background

  • Zero-Shot Learning: The model is required to make clear classifications among the new unseen classes.
    • Seen (Base, Annotated) classes \(C_\text{B}\): Available during training;
    • Unseen (Novel) classes \(C_\text{N}\): Strictly unavailable during training; build classifier by constituting word embeddings.

Note

In ZSL, new classes are identified solely based on the pre-defined word embeddings constituted from \(C_\text{B}\).

Open Vocabulary Learning is proposed to handle the above issue, by providing additional visual language data (e.g., image captions) as auxiliary supervision.

CLIP: a vision-language foundation model with remarkable zero-shot capability, playing an important role in Open Vocabulary Semantic Segmentation (OVSS).

Background (Cont’d)

In practice, CLIP is often employed as an encoder. To exploit the zero-shot generalization capability of CLIP, researchers mainly focus on designing intricate decoders to accomodate pixel-level perception.

CLIP-based OVSS

Motivation

💀 Spatial Resolution vs. Semantic Quality: Deep features often sacrifice spatial resolution for semantic quality. [1]

💀 Global Bias: CLIP was trained by image-level alignment, and the globally aligned features are not well-suited for dense prediction tasks like semantic segmentation, which puts more emphasis on local features. [2]

Note

The prediction head (decoder) is able to upsample LR feature maps into HR predictions. CLIP focuses only on global \([\mathrm{CLS}]\) tokens, and even though patch-level tokens can be generated, they are inevitably comtaminated by global bias, which is detrimental to dense prediction.

💀 Remote-Sensing-specific issue: Unlike natural images, RS images are sensitive to low resolution features, previous solutions designed for natural images are sub-optimal.

Limitations of previous state-of-the-art OVSS methods

Abstract

Remote sensing image plays an irreplaceable role in fields such as agriculture, water resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote sensing image applications; however, a prevalent limitation remains the need for extensive manual annotation. For this, we try to introduce open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle this issue, we propose a simple and general upsampler, SimFeatUp, to restore lost spatial information in deep features in a training-free style. Further, based on the observation of the abnormal response of local patch tokens to [CLS] token in CLIP, we propose to execute a straightforward subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets spanning semantic segmentation, building extraction, road detection, and flood detection tasks. Our method achieves an average of 5.8%, 8.2%, 4%, and 15.3% improvement over state-of-the-art methods on 4 tasks.

Tip

Supplementary explanation about the functionality of SimFeatUp:

  • It needs training on a few images;
  • It is used for upsampling, to restore spatial information for semantic segmentation;
  • Train once, run anywhere (plug-and-play).

Working Pipeline

Training pipeline

Reasoning pipeline

How to upsample the output

FeatUp is an model-agnostic upsampler aims at reconstructing low resolution features. [1]

FeatUp Training Architecture

Training paradigm:

  • \(\sigma_{\uparrow}\): Feedforward upsampler. A parameterized generalization of a Joint Bilateral Upsampling (JBU) [3] filter.
  • \(\sigma_{\downarrow}\): Downsampler.

\[\mathcal{L}_{\mathrm{reconstruct}} = \|\boldsymbol{X} - \sigma_{\downarrow}\left(\sigma_{\uparrow}\left(\boldsymbol{X}\right)\right)\|_2^2.\]

How to upsample the output (Cont’d)

A major objective of this study is to develop a plug-and-play module capable of upsampling the encoder’s output without disrupting the original processing pipeline.

If upsampling is performed at the end of the encoding stage:

  • FeatUp is trained with CLIP as a frozen backbone, where all Transformer layers employ standard vanilla attention (i.e., \(\mathrm{softmax}\left(\frac{\boldsymbol{q}\;\boldsymbol{k}^\mathsf{T}}{\sqrt{d}}\right)\;\boldsymbol{v}\)). However, during inference, the last layer usually undergoes significant modifications in practice, which introduce potential inconsistencies, including:
    • altering the attention mechanism (i.e., using a modulated attention function);
    • removing the feed-forward network (FFN);
    • and omitting layer normalization.
  • The resulting feature maps fed into the decoder will have a larger spatial resolution, necessitating structural modifications to the decoder to accommodate this mismatch, or another projection before decoding.

To address this, the authors propose performing upsampling at an earlier stage of the encoder. Specifically, given an encoder with \(N\) layers, the output of the \((N-1)\)-th layer, denoted as \(X_{N-1}\), is subjected to a projection operation in advance, followed by upsampling. The resulting feature map is then used as input to the final layer of the encoder:

\[X'_{N-1} = \sigma_{\uparrow}\left(\mathrm{Proj}\left(X_{N-1}\right)\right).\]

The core of FeatUp — JBU

Joint Bilateral Upsampling (JBU) was introduced in 2007, predating the widespread adoption of deep learning techniques.

In scenarios where the raw high-resolution image \(\boldsymbol{\tilde{I}}\) exceeds the computational capacity for direct processing, a downsampling operation is applied to reduce its size, yielding a low-resolution representation \(\boldsymbol{S}\). Following the completion of processing, the low-resolution output \(\boldsymbol{S}\) must be upsampled to reconstruct the high-resolution solution \(\boldsymbol{\tilde{S}}\).

For a given position \(\boldsymbol{p}\) in \(\boldsymbol{\tilde{S}}\), the high-resolution output is computed as:

\[\tilde{\boldsymbol{S}_{\boldsymbol{p}}} = \frac{1}{k_\boldsymbol{p}}\sum_{\boldsymbol{q}_\downarrow \in \Omega}\boldsymbol{S}_{\boldsymbol{q}_\downarrow}\;f\left(\|\boldsymbol{p}_\downarrow - \boldsymbol{q}_\downarrow\|\right)\;g\left(\|\tilde{\boldsymbol{I}_\boldsymbol{p}} - \tilde{\boldsymbol{I}_\boldsymbol{q}}\|\right),\]

where:

  • \(f\) represents the spatial filter kernel (spatial similarity), e.g., a Gaussian function centered at \(\boldsymbol{p}\);
  • \(g\) denotes the range filter kernel (range similarity), e.g., a Gaussian function centered at \(\tilde{\boldsymbol{I}}_\boldsymbol{p}\);
  • \(\Omega\) defines the spatial support, corresponding to the kernel size.

This formulation ensures that the upsampling process preserves both spatial and intensity-based consistency, leveraging the joint bilateral filtering mechanism.

FeatUp’s implementation:

Spatial filter kernel: \(k_\mathrm{spatial}(\boldsymbol{p}, \boldsymbol{q}) = \exp\left(\frac{-\|\boldsymbol{p} - \boldsymbol{q}\|_2^2}{2\;\tau_\mathrm{spatial}^2}\right).\)

Range filter kernel: \(k_\mathrm{range}(\boldsymbol{p}, \boldsymbol{q}) = \mathrm{softmax}_{\left(a, b\right) \in \Omega}\left(\frac{1}{\tau_{\mathrm{range}^2}}\;\mathrm{MLP}\left(G\left[i, j\right]\right)\cdot\mathrm{MLP}\left(G\left[a, b\right]\right)\right).\)

More about FeatUp

[1] introduced two distinct architectures for the FeatUp upsampler: one leveraging the previously discussed Joint Bilateral Upsampling (JBU) technique, and the other employing an implicit deep network.

Two FeatUp architectures

The JBU-based upsampler utilizes a stack of parameterized JBU modules (each module has its independent parameters) to reconstruct high-resolution feature maps. Specifically, given an original image \(\boldsymbol{x}\) and its corresponding low-resolution feature map \(f(\boldsymbol{x})\), the high-resolution feature map \(\boldsymbol{F}_\mathrm{hr}\) is obtained through the following iterative process:

\[\boldsymbol{F}_\mathrm{hr} = \left(\mathrm{JBU}\left(\cdot, \boldsymbol{x}\right) \circ \mathrm{JBU}\left(\cdot, \boldsymbol{x}\right) \circ \cdots\right)\left(f\left(\boldsymbol{x}\right)\right).\]

Important

JBU-based upsamplers impose strong spatial priors to accurately recover lost spatial information, wheras implicit upsamplers recover by learning high quality features.

Improving FeatUp

Both \(\sigma_{\uparrow}\) and \(\sigma_{\downarrow}\) are parameterized and learnable, which introduces a relatively weak constraint. This flexibility allows \(\sigma_{\uparrow}\) and \(\sigma_{\downarrow}\) to be optimized in a manner that minimizes the overall loss. However, this optimization may lead to high-resolution (HR) images generated by \(\sigma_{\uparrow}\) that are inconsistent with their corresponding low-resolution (LR) counterparts.

To address this issue, an additional image-level loss is introduced:

\[\mathcal{L}_{\mathrm{image}} = \|\boldsymbol{I} - \mathrm{CRN}\left(\sigma_{\uparrow}\left(\mathcal{O}\left[1:hw + 1\right]\right)\right)\|_2^2,\]

where \(\mathrm{CRN}\) denotes a lightweight content retention network designed to preserve content consistency.

The total loss function for constructing SimFeatUp is then defined as:

\[\mathcal{L} = \mathcal{L}_{\mathrm{reconstruct}} + \mu\;\mathcal{L}_{\mathrm{image}}.\]

With or without image reconstruction

Specifically, CRN consists of two 2D convolutional layers with activation and a Tanh activation layer, where the Tanh layer is designed to constrain the output to [-1, 1], cf. VAEs [4].

Improving FeatUp (Cont’d)

JBU_Stack \(\rightarrow\) JBU_One

The original JBU-based FeatUp has too many paramters, and it is because independently parameterized JBU modules are stacked too many times (this is not a problem for training-required settings), which brings too much uncertainty for training-free settings since each JBU’s behavior is indeterminable. With such an insight, the authors simplify JBU_Stack to JBU_One, i.e., merely one parameterized JBU is used for upsampling. However, if more upsampling is required, one can still execute JBU_One several times.

Important

In the ablation study, it is found that JBU_One reduces paramters by nearly 4 times while delivering a slight IoU gain.

Larger kernel size

In view of the varying object scale spanning in remote sensing images, the upsampling kernel is raised up from \(7 \times 7\) to \(11 \times 11\). The concern here is a larger kernel size may bring more irrelavant context, but \(k_\mathrm{spatial}\) has such a property that the larger \(\|\boldsymbol{p} - \boldsymbol{q}\|\) is, the smaller weight it contributes to the obtained HR features.

Why CLIP brings “global bias”?

In the last layer of CLIP, given input: \(\boldsymbol{X} = \left[\boldsymbol{x}_\mathrm{cls}, \boldsymbol{x}_1, \boldsymbol{x}_2, \cdots, \boldsymbol{x}_{hw}\right]^\mathsf{T} \in \mathbb{R}^{(hw + 1) \times d}\);

  • Query vector: \(\boldsymbol{q} = \mathrm{Embedding}_{\boldsymbol{q}}\left(\boldsymbol{X}\right)\);
  • Key vector: \(\boldsymbol{k} = \mathrm{Embedding}_{\boldsymbol{k}}\left(\boldsymbol{X}\right)\);
  • Value vector: \(\boldsymbol{v} = \mathrm{Embedding}_{\boldsymbol{v}}\left(\boldsymbol{X}\right)\).

\[\boldsymbol{y} = \boldsymbol{X} + \mathrm{softmax}\left(\frac{\boldsymbol{q}\;\boldsymbol{k}^\mathsf{T}}{\sqrt{d}}\right)\;\boldsymbol{v}.\]

\[\boldsymbol{z} = \boldsymbol{y} + \mathrm{FeedForwardNet}\left(\mathrm{LayerNormalization}\left(\boldsymbol{y}\right)\right).\]

Output:

\[\mathcal{O} = \mathrm{Proj}\left(\boldsymbol{z}\right) = \left[\boldsymbol{o}_\mathrm{cls}, \boldsymbol{o}_1, \cdots, \boldsymbol{o}_c\right]^\mathsf{T} \in \mathbb{R}^{(hw + 1) \times c}.\]

The learnable global token \(\boldsymbol{x}_\mathrm{cls}\) captures the aggregated information of the sequence, which aims at balancing the global information and local information. This is ignorable for classification tasks, but detrimental to dense classification tasks like segmentation.

What is the culprit of “global bias”?

Note

Before ClearCLIP [5], people believe self-attention is the culprit of “noise”.

  • MaskCLIP [6]: Sets query-key attention …
  • CLIPSurgery [7]: Argues that the value-value attention …
  • SCLIP [8]: Combines the query-query and value-value attention …
  • ClearCLIP [5]: We are surprised to find that residual connection, proposed by ResNet and commonly employed in transformer architectures, has significant effect on the adaption of CLIP to OVSS.

Some Insights

Prior studies have sought to mitigate the “noise” inherent in CLIP through three primary strategies, all of which involve modifications to the final layer of the model:

  • Eliminating the residual connection;
  • Substituting vanilla attention with self-self attention;
  • Omitting the Feed Forward Network (FFN).

Tip

Modulated attention in SegEarth-OV (self-self attention):

\[\mathrm{M\text{-}SA} = \sum_{\boldsymbol{i} \in \left\{\boldsymbol{q}, \boldsymbol{k}, \boldsymbol{v}\right\}}\mathrm{softmax}\left(\frac{\boldsymbol{i}\;\boldsymbol{i}^\mathsf{T}}{\sqrt{d}}\right)\;\boldsymbol{v}.\]

Not Enough

How global bias “comtaiminates” segmentation results?

The image is recognized as the building, which is reasonable because the building covers the maximum range in the image.

The highly responsive regions are not only the regions with buildings, some roads and pavements are also activated, which indicates that the global bias contaminates the local patch tokens.

Alleviating Global Bias

Just substract global \(\left[\mathrm{CLS}\right]\) token:

\[\hat{\mathcal{O}} = \mathcal{O}\left[1 : hw + 1\right] - \lambda\;\mathcal{O\left[0\right]},\]

where \(\lambda\) is intensity factor.

Tip

\(\mathrm{O}\left[1 : hw + 1\right]\) retrieves elements \(\boldsymbol{o_1}, \cdots, \boldsymbol{o}_{hw}.\)

class SegEarthSegmentation(BaseSegmentor):
    # ...
    def forward_feature(self, img, logit_size=None):
      # ...
      if self.output_cls_token:
          logits = logits + cls_logits * self.cls_token_lambda
      # ...
# In demo.py
model = SegEarthSegmentation(
    cls_token_lambda=-0.3,
    # ...
)

Experiments

Single-Class Extraction

pretrain: CLIP (ViT-B/16)

Comparison with previous SOTAs

Note

Images of larger size allows the upsampler to preserve more spatial information.

Experiments (Cont’d)

Plug-and-Play

Plug-and-play

Experiments (Cont’d)

Ablation Study

Ablation study
  • Baseline: Removing the FFN and residual connection of the last Transformer block in CLIP, modulate vanilla attention to self-self attention (i.e., \(\mathrm{M\text{-}SA}\)).
  • FeatUp (CLIP): Use FeatUp to upsample the output of CLIP.
  • FeatUp (MaskCLIP): Use MaskCLIP (ECCV 2022).
  • “X”\(\uparrow\): upsample before the last layer.
  • “+ RS Data”: Train SimFeatUp on remote sensing images.
  • “JBU_One”: Replace JBU_Stack with JBU_One.
  • “Rec. Image”: Use CRN to do reconstruction.
  • “Alleviate Global Bias”: Substract the global [CLS] token from patch tokens.
  • “Large Kernel”: Use larger kernel in SimFeatUp.

Contributions

This study advances the application of open-vocabulary semantic segmentation methods, originally designed for natural images, to the domain of remote sensing by addressing three (or two) critical challenges unique to this context. It adapts and integrates existing OVSS methodologies to effectively handle remote sensing segmentation tasks for the first time. The primary contributions of this work are as follows:

  • Pioneering the adaptation of OVSS techniques to remote sensing applications;
  • Proposing a plug-and-play SimFeatUp module for upsampling feature maps generated by open-vocabulary models, thereby enhancing spatial resolution;
  • Providing a novel insight into the “global bias” issue inherent in vision-language models and introducing a global bias subtraction mechanism to mitigate its adverse effects.

References

[1]
S. Fu, M. Hamilton, L. Brandt, A. Feldman, Z. Zhang, and W. T. Freeman, “Featup: A model-agnostic framework for features at any resolution,” arXiv preprint arXiv:2403.10516, 2024.
[2]
T. Shao, Z. Tian, H. Zhao, and J. Su, “Explore the potential of clip for training-free open vocabulary semantic segmentation,” in European conference on computer vision, Springer, 2024, pp. 139–156.
[3]
J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral upsampling,” ACM Transactions on Graphics (ToG), vol. 26, no. 3, pp. 96–es, 2007.
[4]
D. P. Kingma, M. Welling, et al., “Auto-encoding variational bayes.” Banff, Canada, 2013.
[5]
M. Lan, C. Chen, Y. Ke, X. Wang, L. Feng, and W. Zhang, “Clearclip: Decomposing clip representations for dense vision-language inference,” in European conference on computer vision, Springer, 2024, pp. 143–160.
[6]
C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in European conference on computer vision, Springer, 2022, pp. 696–712.
[7]
Y. Li, H. Wang, Y. Duan, and X. Li, “Clip surgery for better explainability with enhancement in open-vocabulary tasks,” arXiv e-prints, pp. arXiv–2304, 2023.
[8]
F. Wang, J. Mei, and A. Yuille, “Sclip: Rethinking self-attention for dense vision-language inference,” in European conference on computer vision, Springer, 2024, pp. 315–332.